LAMP-TR-032 

CAR-TR-906 

CS-TR-3982 


MDA  9049-6C-1250 
January  1999 


A  Statistical,  Nonparametric  Methodology 
for  Document  Degradation  Model  Validation 


Tapas  Kanungo,1  Robert  Haralick,2 
Henry  Baird,3  Werner  Stuezle,4  and  David  Madigan4 


1  Center  for  Automation  Research 
University  of  Maryland 
College  Park,  MD 

2Dept.  of  Electrical  Engineering 
University  of  Washington 
Seattle,  WA 


3Xerox  PARC 
Palo  Alto,  CA 

4Dept.  of  Statistics 
University  of  Washington 
Seattle,  WA 


Abstract 

Printing,  photocopying  and  scanning  processes  degrade  the  image  quality  of  a  docu¬ 
ment.  Statistical  models  of  these  degradation  processes  are  crucial  for  document  image 
understanding  research.  Models  allow  us  to  predict  system  performance;  conduct  con¬ 
trolled  experiments  to  study  the  break-down  points  of  the  systems;  create  large  multi¬ 
lingual  data  sets  with  groundtruth  for  training  classifiers;  design  optimal  noise  removal 
algorithms;  choose  values  for  the  free  parameters  of  the  algorithms;  and  so  on.  Although 
research  in  document  understanding  started  many  decades  ago,  only  two  document  degra¬ 
dation  models  have  been  proposed  thus  far.  Furthermore,  no  attempts  have  been  made 
to  statistically  validate  these  models. 

In  this  paper  we  present  a  statistical  methodology  that  can  be  used  to  validate  local 
degradation  models.  This  method  is  based  on  a  non-parametric,  two-sample  permutation 
test.  Another  standard  statistical  device  —  the  power  function  —  is  then  used  to  choose 
between  algorithm  variables  such  as  distance  functions.  Since  the  validation  and  the 
power  function  procedures  are  independent  of  the  model,  they  can  be  used  to  validate 
any  other  degradation  model.  A  method  for  comparing  any  two  models  is  also  described. 
It  uses  p-values  associated  with  the  estimated  models  to  select  the  model  that  is  closer 
to  the  real  world. 


This  research  was  funded  in  part  by  the  Department  of  Defense  and  the  Army  Research  Laboratory 
under  Contract  MDA  9049-6C-1250. 


Report  Documentation  Page 

Form  Approved 

OMB  No.  0704-0188 

Public  reporting  burden  for  the  collection  of  information  is  estimated  to  average  1  hour  per  response,  including  the  time  for  reviewing  instructions,  searching  existing  data  sources,  gathering  and 
maintaining  the  data  needed,  and  completing  and  reviewing  the  collection  of  information.  Send  comments  regarding  this  burden  estimate  or  any  other  aspect  of  this  collection  of  information, 
including  suggestions  for  reducing  this  burden,  to  Washington  Headquarters  Services,  Directorate  for  Information  Operations  and  Reports,  1215  Jefferson  Davis  Highway,  Suite  1204,  Arlington 

VA  22202-4302.  Respondents  should  be  aware  that  notwithstanding  any  other  provision  of  law,  no  person  shall  be  subject  to  a  penalty  for  failing  to  comply  with  a  collection  of  information  if  it 
does  not  display  a  currently  valid  OMB  control  number. 

1.  REPORT  DATE 

JAN  1999 

2.  REPORT  TYPE 

3.  DATES  COVERED 

00-01-1999  to  00-01-1999 

4.  TITLE  AND  SUBTITLE 

5a.  CONTRACT  NUMBER 

A  Statistical,  Nonparametric  Methodology  for  Document  Degradation 

5b.  GRANT  NUMBER 

1V1UUCI  V  illicit!  HUH 

5c.  PROGRAM  ELEMENT  NUMBER 

6.  AUTHOR(S) 

5d.  PROIECT  NUMBER 

5e.  TASK  NUMBER 

5f.  WORK  UNIT  NUMBER 

7.  PERFORMING  ORGANIZATION  NAME(S)  AND  ADDRESS(ES) 

Language  and  Media  Processing  Laboratory, Institute  for  Advanced 
Computer  Studies, University  of  Maryland, College  Park, MD, 20742-3275 

8.  PERFORMING  ORGANIZATION 

REPORT  NUMBER 

9.  SPONSORING/MONITORING  AGENCY  NAME(S)  AND  ADDRESS(ES) 

10.  SPONSOR/MONITOR'S  ACRONYM(S) 

11.  SPONSOR/MONITOR'S  REPORT 
NUMBER(S) 

12.  DISTRIBUTION/AVAILABILITY  STATEMENT 

Approved  for  public  release;  distribution  unlimited 

13.  SUPPLEMENTARY  NOTES 

The  original  document  contains  color  images. 

14.  ABSTRACT 

15.  SUBIECT  TERMS 

16.  SECURITY  CLASSIFICATION  OF: 

17.  LIMITATION  OF 
ABSTRACT 

18.  NUMBER 
OF  PAGES 

30 

19a.  NAME  OF 
RESPONSIBLE  PERSON 

a.  REPORT 

unclassified 

b.  ABSTRACT 

unclassified 

c.  THIS  PAGE 

unclassified 

Standard  Form  298  (Rev.  8-98) 

Prescribed  by  ANSI  Std  Z39-18 


LAMP-TR-032 

CAR-TR-906 

CS-TR-3982 


MDA  9049-6C-1250 
January  1999 


A  Statistical,  Nonparametric  Methodology 
for  Document  Degradation  Model  Validation 

Tapas  Kanungo,  Robert  Haralick, 

Henry  Baird,  Werner  Stuezle,  and  David  Madigan 


A  Statistical,  Nonparametric  Methodology 
for  Document  Degradation  Model  Validation 

Tapas  Kanungo,1  Robert  Haralick,2 
Henry  Baird,3  Werner  Stuezle,4  and  David  Madigan4 

1  Center  for  Automation  Research 
University  of  Maryland,  College  Park,  MD 

2 Dept,  of  Electrical  Engineering 
University  of  Washington,  Seattle,  WA 

3Xerox  PARC,  Palo  Alto,  CA 
4Dept.  of  Statistics 

University  of  Washington,  Seattle,  WA 

Abstract 

Printing,  photocopying  and  scanning  processes  degrade  the  image  quality  of  a  docu¬ 
ment.  Statistical  models  of  these  degradation  processes  are  crucial  for  document  image 
understanding  research.  Models  allow  us  to  predict  system  performance;  conduct  con¬ 
trolled  experiments  to  study  the  break-down  points  of  the  systems;  create  large  multi¬ 
lingual  data  sets  with  groundtruth  for  training  classifiers;  design  optimal  noise  removal 
algorithms;  choose  values  for  the  free  parameters  of  the  algorithms;  and  so  on.  Although 
research  in  document  understanding  started  many  decades  ago,  only  two  document  degra¬ 
dation  models  have  been  proposed  thus  far.  Furthermore,  no  attempts  have  been  made 
to  statistically  validate  these  models. 

In  this  paper  we  present  a  statistical  methodology  that  can  be  used  to  validate  local 
degradation  models.  This  method  is  based  on  a  non-parametric,  two-sample  permutation 
test.  Another  standard  statistical  device  —  the  power  function  —  is  then  used  to  choose 
between  algorithm  variables  such  as  distance  functions.  Since  the  validation  and  the 
power  function  procedures  are  independent  of  the  model,  they  can  be  used  to  validate 
any  other  degradation  model.  A  method  for  comparing  any  two  models  is  also  described. 
It  uses  p-values  associated  with  the  estimated  models  to  select  the  model  that  is  closer 
to  the  real  world. 


This  research  was  funded  in  part  by  the  Department  of  Defense  and  the  Army  Research  Laboratory 
under  Contract  MDA  9049-6C-1250. 


1  Introduction 


Printing,  photocopying  and  scanning  processes  degrade  the  image  quality  of  any  doc¬ 
ument.  Statistically  valid  models  of  these  degradation  processes  can  impact  document 
image  understanding  research  in  many  ways.  Degradation  models  can  be  used  to  con¬ 
duct  controlled  experiments  to  study  the  breakdown  points  of  OCR  systems;  create  large 
multilingual  data,  sets  with  groundtruth  for  training  classifiers;  design  optimal  noise  re¬ 
moval  algorithms;  choose  values  for  the  free  parameters  of  the  algorithms;  predict  OCR 
performance;  and  so  on.  Whereas  research  in  document  understanding  started  decades 
ago,  only  two  document  degradation  models  have  been  proposed  thus  far.  Furthermore, 
no  attempts  have  been  made  to  statistically  validate  these  models. 

The  current  OCR  evaluation  methods  rely  on  scanned  documents,  corresponding 
hand-entered  ASCII  groundtruth  strings,  and  string  matching  algorithms  that  compare 
the  groundtruth  string  against  the  OCR-generated  string.  The  errors  in  the  groundtruth 
are  reduced  by  a  process  of  cross-checking.  This  method  is  very  expensive,  laborious,  and 
prone  to  errors.  Furthermore,  since  the  datasets  are  expensive,  it  is  not  possible  to  create 
large  datasets  that  are  representative  of  the  variety  of  layout,  font,  and  degradation  levels 
seen  in  real-world  documents.  Despite  these  problems,  various  document  databases  with 
groundtruth  have  been  created. 

Our  methodology  for  characterizing  OCR  algorithms  is  based  on  evaluating  the  al¬ 
gorithms  on  synthetically  degraded  documents.  First,  a  wordprocessor  is  used  to  create 
an  ideal  document  in  any  language,  format  or  style.  A  bitmap  version  of  this  document 
is  then  created  and  degraded  using  a  computer  model  of  the  real  degradation  process. 
This  method  has  many  advantages.  First,  since  the  ideal  document  is  created  using  a 
wordprocessor,  the  groundtruth  information  associated  with  each  character  —  location, 
identity,  font  type,  etc.  —  is  known  without  error.  Second,  the  word  processor  can  be 
used  to  reformat  the  documents  (two  columns,  one  column,  various  font  types,  sizes, 
etc.)  to  study  the  sensitivity  of  the  OCR  algorithm  to  these  variables.  Third,  since  the 
degradation  model  is  under  our  control,  we  can  create  documents  with  varying  levels  of 
degradation  and  study  how  and  where  the  OCR  algorithm  breaks  down.  Fourth,  sample 
size  is  not  a  problem  at  all  —  any  number  of  degraded  samples  can  be  created  since 
all  that  needs  to  be  done  is  to  simulate  another  set  of  characters.  In  addition,  there  is 
no  dearth  of  formatted  documents  —  we  create  such  documents  daily,  and  so  do  aca¬ 
demic  journal  publishers.  Fifth,  the  model  itself  can  be  used  in  creating  noise  removal 
algorithms,  training  classifiers,  choosing  algorithm  parameters,  etc. 

The  main  drawback  of  the  above  methodology  is  that  it  relies  heavily  on  the  simulation 
model  being  correct.  That  is,  it  assumes  that  the  simulation  model  closely  mimics  reality. 
It  is  thus  imperative  that  we  validate  the  degradation  model  against  real  data.  Only 
then  can  the  simulations  be  used  in  place  of  real  data.  If  the  degradation  model  is  not 
validated,  results  on  the  synthetically  degraded  documents  should  be  used  with  caution, 
though  they  are  still  useful  since  they  give  some  indication  about  the  performance  of 
the  OCR  algorithm. 

In  this  paper  we  present  a  statistical  methodology  that  can  be  used  to  validate  the 
local  degradation  models.  This  method  is  based  on  a  non-parametric,  two-sample  per¬ 
mutation  test.  Another  standard  statistical  device  —  the  power  function  —  is  then  used 
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to  choose  between  algorithm  variables  such  as  distance  functions.  Since  the  validation 
and  the  power  function  procedures  are  independent  of  the  model,  they  can  be  used  to 
validate  any  other  degradation  model.  A  method  of  comparing  any  two  models  is  also 
described.  It  uses  p-values  associated  with  the  estimated  models  to  select  the  model  that 
is  closer  to  the  real  world. 

In  Section  2  we  survey  the  related  literature  in  the  areas  of  degradation  models, 
model  validation,  and  statistical  hypothesis  testing  and  discuss  their  shortcomings.  In 
Section  3  we  describe  our  document  degradation  model  for  the  local  distortions  that  occur 
while  printing,  photocopying  and  scanning  documents.  The  model  is  independent  of  the 
language  in  which  the  document  is  written.  Our  validation  methodology  is  described  in 
Section  4,  and  in  Section  5  we  apply  it  to  datasets  with  known  distributions  to  understand 
the  performance  of  the  permutation  tests.  In  Section  6  we  give  experimental  protocols 
and  results  of  validation  experiments  on  document  images,  and  in  Section  7  we  present 
conclusions. 

2  Related  Literature 

The  earliest  work  on  document  degradation  models  is  that  of  Baird  [2,  3,  4],  Unfortu¬ 
nately,  his  degradation  model  is  not  validated.  Furthermore,  his  paper  advocates  the 
use  of  isolated,  synthetically  degraded  characters.  Thus  the  degradation  due  to  merg¬ 
ing  of  neighboring  characters  is  not  reflected  in  his  model.  In  addition,  the  uni-gram 
and  bi-gram  occurrence  probabilities  of  characters  in  real-world  text  are  not  reflected  in 
isolated-character  experiments. 

In  contrast,  our  document  degradation  model,  which  is  described  in  Section  3,  advo¬ 
cates  the  use  of  complete  documents  for  generating  synthetically  degraded  characters.  It 
thus  takes  into  account  the  degradations  arising  due  to  merging  of  characters,  the  occur¬ 
rence  probabilities  of  individual  characters,  and  the  variability  in  the  layout  structure  of 
the  documents.  The  pixel  degradations  themselves  are  based  on  a  local  morphological 
model,  which  models  the  final  spatial  characteristics  of  the  degradation  process  rather 
than  the  underlying  physical  process. 

To  the  best  of  our  knowledge,  the  only  other  work  on  validation  of  document  degra¬ 
dation  models  is  that  of  Nagy  and  Lopresti  [23,  21,  22],  They  are  of  the  opinion  that 
a  degradation  model  is  valid  if  the  OCR  confusion  matrices  that  result  from  synthet¬ 
ically  degraded  documents  are  similar  to  the  OCR  confusion  matrices  produced  from 
real  documents.  Unfortunately,  this  methodology  validates  the  model-OCR  combination 
and  not  the  model  itself.  For  instance,  if  the  OCR  system  automatically  filters  noise, 
their  validation  process  will  not  detect  any  difference  between  the  real  documents  and 
the  synthetically  degraded  documents  even  if  the  degradation  process  adds  noise  to  the 
document.  Furthermore,  although  they  treat  the  OCR  as  a  black  box,  the  OCR  algo¬ 
rithm  itself  has  many  parameters  that  can  greatly  influence  the  decisions  of  the  validation 
procedure.  Another  drawback  of  their  approach  is  that  they  do  not  indicate  how  their 
validation  procedure  can  be  compared  to  other  validation  procedures. 

Our  validation  method,  on  the  other  hand,  reduces  the  problem  of  model  validation 
to  a  nonparametric  statistical  hypothesis  testing  problem,  which  is  a  well  studied  and  ac¬ 
cepted  method  in  statistics  [8,  7].  In  addition,  we  use  simple  character  distance  functions 


2 


for  the  validation  procedure,  instead  of  entire  OCR  systems.  Although  the  validation 
process  now  depends  on  these  distance  functions,  they  are  much  simpler  than  OCR  black 
boxes.  Finally,  we  provide  a  technique  for  comparing  our  validation  method  with  other 
validation  methods.  This  comparison  procedure  is  based  on  “power  functions,”  which 
again  are  standard  statistical  devices  for  comparing  hypothesis  testing  procedures. 

3  A  Document  Degradation  Model 

In  this  section  we  describe  a  document  degradation  model  for  local  distortions  that  are 
introduced  during  the  printing,  photocopying  and  scanning  processes.  A  model  for  the 
perspective  and  illumination  distortions  that  get  introduced  when  we  photocopy  or  scan 
thick  bound  books  is  described  in  [15,  16,  12]. 

Our  local  document  degradation  model  accounts  for  (i)  pixel  inversion  (from  fore¬ 
ground  to  background  and  vice  versa)  that  occurs  independently  at  each  pixel  due  to 
light  intensity  fluctuations,  sensitivity  of  the  sensors,  and  the  thresholding  level,  and  (ii) 
blurring  that  occurs  due  to  the  point-spread  function  of  the  scanner  optical  system.  We 
model  the  pixel-flipping  probability  of  a  background  pixel  as  an  exponential  function  of 
its  distance  from  the  nearest  boundary  pixel.  The  parameter  a0  is  the  initial  value  for 
the  exponential  and  the  decay  speed  of  the  exponential  is  controlled  by  the  parameter 
a.  The  foreground  and  background  4-neighbor  distance  are  computed  using  a  standard 
distance  transform  algorithm  (see  [5]).  The  flipping  probabilities  of  the  foreground  pix¬ 
els  are  similarly  controlled  by  f30  and  jH.  The  parameter  tj  is  the  constant  probability  of 
flipping  for  all  pixels.  Finally,  the  last  parameter  k,  which  is  the  size  of  the  disk  used 
in  the  morphological  closing  operation,  accounts  for  the  correlation  introduced  by  the 
point-spread  function  of  the  optical  system. 

The  degradation  model  thus  has  six  parameters:  0  =  (//,  o0,  a,  fi{).  j.  k)‘ .  These  pa¬ 
rameters  are  used  to  degrade  an  ideal  binary  image  as  follows: 

1.  Compute  the  distance  d  of  each  pixel  from  the  character  boundary. 

2.  Flip  each  foreground  pixel  with  probability  p(0|l,  d.  a0,  a)  =  a0e~ad2  +  rj. 

3.  Flip  each  background  pixel  with  probability  p(l|0,  d,  /30,  /3)  =  f30e~^d2  +  r], 

4.  Perform  a  morphological  closing  operation  with  a  disk  structuring  element  of  di¬ 
ameter  k. 

Software  for  simulating  noisy  documents  using  the  above  degradation  model  is  avail¬ 
able  from  the  University  of  Washington  English  Document  Database  I,  and  the  model 
has  also  been  described  in  the  literature  [16,  15].  The  application  of  the  various  steps 
of  the  model  is  illustrated  in  Figure  1.  In  Figure  1(a)  we  show  an  ideal  character.  The 
distance  transform  of  the  foreground  of  (a)  is  shown  in  (b).  The  brighter  pixels  are  fur¬ 
ther  away  from  the  pixel  boundary.  The  distance  transform  of  the  background  is  shown 
in  Figure  1(d).  The  ideal  image  after  its  pixels  have  been  flipped  according  to  the  model 
is  shown  in  (d).  The  final  image  after  the  closing  operation  is  shown  in  (e). 

The  procedure  described  above  works  on  bit-mapped  images.  Since  there  is  no  restric¬ 
tion  on  the  size  of  the  image  that  can  be  degraded,  or  the  language  of  the  written  text,  an 
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Figure  1:  Local  document  degradation  model:  (a)  Ideal  noise- free  character;  (b)  Distance 
transform  of  the  foreground;  (c)  Distance  transform  of  the  background;  (d)  Result  of 
the  random  pixel-flipping  process  (the  probability  of  a  pixel  flipping  is  p(0\d,l3,f)  = 
p(l\d,  a:b)  =  a0e~ad2 ;  here  a  =  (3  =  2,  atQ  =  f30  =  1);  (e)  Morphological  closing  of  the 
result  in  (d)  by  a  2  X  2  binary  structuring  element. 


entire  document  page  image  can  be  degraded  using  this  model.  In  fact,  since  typesetting 
is  under  the  experimenter’s  control,  the  same  text  can  be  re-formatted  in  various  styles 
(single  column,  multiple  column,  report,  book,  etc.),  font  types  (Roman,  Helvetica,  etc), 
and  font  sizes  (9pt,  lOpt,  12pt,  etc.).  Thus  the  performance  of  any  character  recognition 
system  can  be  studied  by  providing  as  input  the  same  (or  different)  text  formatted  in 
various  styles  with  varied  but  controlled  degradation. 

We  now  show  examples  where  we  degrade  complete  document  pages  using  our  degra¬ 
dation  model.  In  Figure  2(a),  we  show  an  ideal  document  formatted  in  l)T]'X  using  the 
IEEE  Transactions  typesetting  style.  In  Figure  2(b)  we  show  a  degraded  version  of  the 
document  in  Figure  2(a). 

The  noise-free  documents  are  typeset  using  the  DTgX  formatting  system  [20,  19]. 
The  ASCII  hies  containing  the  text  and  the  RTgX  typesetting  information  are  then 
converted  into  a  device-independent  format  (DVI)  using  IX I  ]  X .  A  software  program 
called  DVI2TIFF  —  which  is  a  modified  version  of  a  DVI  hie  previewer  called  XDVI 
[24]  —  is  run  to  produce  one  bit /pixel  binary  images  in  TIFF  format  from  the  DVI  hies. 
Besides  producing  the  binary  images  of  the  documents,  DVI2TIFF  also  produces  the 
groundtruth  information  regarding  each  character  in  the  document  image. 

The  local  document  degradation  model  is  another  software  program  called  DDM. 
This  program  takes  as  input  an  ideal  binary  document  image  in  TIFF  format,  and  a  hie 
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I.  Introduction 

SINCE  the  early  1940s  a  large  number  of  artificial  neural 
systems  have  been  proposed  by  neural  scientists.  The  dy¬ 
namical  behavior  of  these  systems  may  be  mathematically 
described  by  sets  of  coupled  equations  like  differential  e- 
quations  for  formal  neurons  with  graded  response.  The 
investigation  of  essential  features  of  neural  systems  such  as 
stability  and  adaptation  depends  strongly  upon  the  state 
of  the  mathematical  theory  to  be  applied  and  on  a  con¬ 
crete  and  efficient  analysis  of  dynamical  equations.  Unlike 
abstract  theoretical  research  in  which  the  mathematical 
objects  adopted  are  frequently  assumed  to  be  of  certain 
canonical  form,  the  neurodynamics  is  usually  complicat¬ 
ed  due  to  various  biological  facts  which  should  be  taken 
account  of  to  a  degree  as  large  as  possible.  Consequently, 
this  makes  the  analysis  and  derivation  very  complex,  some¬ 
times  to  an  extent  which  is  beyond  human  capacity,  and 
the  traditional  methods  and  tools  of  mathematics  are  not 
always  sufficient.  It  is  therefore  proposed  in  [19]  to  use  and 
extend  the  methods  and  software  systems  of  symbolic  com¬ 
putation  for  handling,  analyzing  and  constructing  neuro¬ 
dynamics  and  its  related  objects.  The  present  paper  is  the 
continuation  of  our  work  in  this  direction.  The  attempt  is 
to  demonstrate  how  symbolic  computation  can  be  applied 
to  aid  the  analysis  and  derivation  of  neural  systems. 


calculations,  symbolic  computation  treats  objects  with  se¬ 
mantics  like  functions,  formulae  and  programs.  A  variety 
of  software  systems  for  performing  symbolic  computation 
have  been  developed  for  research  and  applications  in  nat¬ 
ural  and  technical  sciences.  However,  the  existing  systems 
cannot  be  directly  used  for  the  analysis  and  derivation  of 
neural  systems  as  the  operations  on  the  occurring  objects, 
particularly  those  involving  an  unspecified  number  of  argu¬ 
ments  like  indefinite  summations,  have  not  yet  been  taken 
into  account.  To  achieve  our  goal,  some  rules  for  differen¬ 
tiating  and  integrating  indefinite  summations  with  respect 
to  indexed  variables  were  proposed  [20].  A  toolkit  has  been 
designed  and  implemented  in  MACSYMA  for  manipulat¬ 
ing  these  objects  occurring  in  the  analysis  and  derivation 
of  neural  systems  [21]. 

In  the  next  section,  we  introduce  the  general  method 
and  techniques  for  the  stability  analysis  of  artificial  neural 
systems.  The  role  of  symbolic  computation  for  representing 
and  manipulating  the  objects  concerning  neural  systems  is 
discussed  in  Section  III.  In  Section  IV  we  present  some 
strategies  for  using  computer  algebra  (CA)  systems  and 
their  extension  to  analyse  the  stability  of  neural  systems 
and  to  derive  novel  stable  systems.  A  brief  description 
of  a  toolkit  developed  in  MACSYMA  is  also  provided.  A 
concrete  example  is  given  in  Section  V  to  illustrate  the 
derivation  of  a  hybrid  model  by  our  toolkit.  Section  VI 
contains  a  discussion  on  future  developments.  The  paper 
is  closed  with  a  brief  summary. 

II.  Stability  Analysis  op  Neural  Systems 
Consider  artificial  neural  systems  which  are  described  by 
coupled  systems  of  differential  equations  of  the  form 

i=F(x,w,K)  (1) 

w=  G(x,w,K)  (2) 

where  x  =  (*i(t),  ...,asn(t))  is  the  activation  state  vec¬ 
tor,  w  =  (ruy(t))  is  the  weight  matrix  of  dimension  n  x 
n,  n  is  the  number  of  nodes  and  if  is  an  external  time- 
independent  pattern  vector.  Such  systems  of  differential 
equations  which  describe  the  neural  model  will  occasional¬ 
ly  be  named  neurodynamics. 

Once  a  neural  model  is  proposed,  its  main  features  are 
represented  by  its  dynamic  behavior.  The  adaptability  of 
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Figure  2:  (a)  An  ideal  document  page  typeset  using  DTgX  and  IEEE  Transactions  type¬ 
setting  style,  (b)  A  synthetically  degraded  version  of  the  document  in  (a). 

containing  the  degradation  model  parameter  values,  and  produces  the  binary  degraded 
images  in  TIFF  format. 

Both  programs  —  DVI2TIFF  and  DDM  —  are  implemented  using  the  C  language 
and  have  been  tested  on  SUN  and  IBM  machines  running  the  UNIX  operating  system. 
The  software  is  available  on  the  UW  CD-ROM- 1  [9]. 

4  Model  Validation  and  Parameter  Estimation 
4.1  Statistical  Problem  Definition 

In  this  section  we  formulate  the  degradation  model  validation  problem  as  a  statistical 
problem.  Although  degradation  of  the  document  is  over  the  entire  page,  the  degrada¬ 
tion  process  itself  is  local.  That  is,  degradation  in  one  region  does  not  influence  the 
degradation  process  in  another  sufficiently  distant  region.  More  precisely,  the  degrada¬ 
tion  at  a  pixel  is  influenced  only  by  pixels  within  a  local  neighborhood.  Thus,  one  way 


5 


to  characterize  the  degradation  process  is  to  study  the  degradation  of  local  patterns. 
Since  the  most  common  patterns  that  occur  on  a  document  page  are  characters,  we  sta¬ 
tistically  characterize  the  degradation  of  individual  characters  on  the  page  and  use  this 
characterization  to  estimate  the  parameters  of  a  degradation  model  that  produces  similar 
degradations. 

Assume  that  a  scanned  character  is  represented  by  a  30  X  30  matrix  of  zeros  and 
ones.  This  matrix  can  be  represented  as  a  1000  X  1  vector  x  (30  X  30  ~  1000).  Let 
B  be  the  space  of  D  =  1000-dimensional  binary  vectors,  that  is,  B  =  {0, 1}D.  Now, 
let  aq,  x2,  ■  .  . ,  Xn  £  B  be  independent  and  identically  distributed  D-dimensional  vectors 
representing  instances  of  degraded  characters  produced  from  the  same  class  lo.  That  is, 
each  x,  is  a  degraded  character  that  is  produced  from  the  same  ideal  pattern  ua  (say  the 
ideal  character  ‘e’)  and  the  same  degradation  process.  The  validation  problem  we  need 
to  address  is: 

Suppose  we  are  given  a  set  of  real  degraded  instances  .r-| ...  .  . ,  x n  £  B  of  the 
pattern  a j  and  another  set  of  synthetic  degraded  instances  y i, .  .  .  ,  ?/m  £  B  of 
the  pattern  uj.  Test  the  null  hypothesis  that  y i, .  .  . ,  yM  and  ay., .  .  . .  xjy,  are 
samples  taken  from  the  same  underlying  population,  to  a  specified  significance 
level  e. 

In  our  case  D  is  large,  typically  on  the  order  of  1000.  Thus  the  number  of  possible 
ay’s  is  21000,  which  is  approximately  ecpial  to  10300  —  a  daunt  ingly  large  number.  The 
available  sample  size,  N,  is  typically  on  the  order  of  1000.  Thus  the  ay’s  occupy  the 
space  B  extremely  sparsely,  which  implies  that  the  density  function  cannot  be  estimated 
reliably  from  the  sample.  This  fact  prohibits  us  from  performing  any  standard  statistical 
test  based  on  density  estimates.  In  the  next  section  we  describe  a  noil-parametric  method 
that  overcomes  this  problem. 

4.2  Permutation  Tests  and  Model  Validation 

In  this  section  we  describe  a  nonparametric  validation  procedure  that  can  be  used  to 
statistically  validate  any  document  degradation  model.  Suppose  we  are  given  a  set  of 
real  degraded  characters  X  =  {.r-| ,  .r2, .  . . ,  .r.v},  and  another  set  of  artificially  degraded 
characters  Y  =  {?/i,  y2,  ■  ■  . ,  ]Jm}  that  were  created  by  perturbing  an  ideal  character  with 
a  document  degradation  model.  We  can  assume  that  the  characters  ay  and  yt  are  binary 
matrices  of  size  (approximately)  30  X  30.  Note  that  every  ay  and  can  be  of  different 
size  because  the  scanned  characters  can  be  of  different  sizes.  The  question  that  needs 
to  be  addressed  is  whether  or  not  the  ay’s  and  the  y? s  come  from  the  same  underlying 
population.  At  this  point  it  does  not  matter  where  the  ay’s  and  the  yi  s  came  from;  they 
could  both  be  synthetically  generated,  or  both  be  real  instances,  or  one  of  them  could  be 
synthetic  and  the  other  real.  A  statistical  hypothesis  test  can  be  performed  to  test  the 
null  hypothesis  that  the  underlying  population  distributions  of  the  ay’s  and  ty’s  are  the 
same. 

Standard  parametric  hypothesis  testing  procedures  are  not  usable  for  our  problem 
because  the  sizes  of  the  binary  matrices  ay  and  y,  are  not  fixed.  Furthermore,  the  size 
of  the  space  to  which  they  belong  is  very  large  (approximately  2900  if  we  assume  each 
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character  to  be  of  size  30  X  30)  and  so  while  in  principle  it  is  possible  to  estimate  the 
density  function,  in  practice  it  is  not  possible  to  do  so  because  of  the  small  sample  size. 
Instead,  we  now  describe  a  nonparametric  permutation  test  (see  [8,  7])  that  performs  this 
hypothesis  test. 

1.  Given  (i)  the  real  data  X  =  {a^i,  x2, .  .  . ,  x^},  (ii)  the  synthetic  data  Y  =  { vy  | .  y2,  ■  ■  . ,  Dm  }  ■ 
(iii)  a  distance  function  p(X.  V'j  on  sets,  (iv)  a  distance  function  8{x:y)  on  charac¬ 
ters,  and  (v)  the  size  e  of  the  test  (i.e.  misdetection  rate  =  e). 

2.  Compute  d0  =  p[X:Y). 

3.  Create  a  new  sample  Z  =  {^i, .  .  .  ,  r  \  .  y1:  .  .  . ,  ?/m}-  Thus  Z  has  N  +  M  elements 
labeled  z,,  i  —  1, .  .  .  ,  N  +  M. 

4.  Randomly  permute  (reorder)  Z. 

5.  Partition  the  set  Z  into  two  sets  X'  and  Y'  where  X'  =  {^i, .  .  . ,  Zn}  and  V  = 
{^W+l,  ■  ■  ■  ,  ZN+m}- 

6.  Compute  =  p(Xr,  Y'). 

7.  Repeat  steps  4,  5  and  6  K  times  to  get  K  distances  d\, .  .  . ,  dj 

8.  Compute  the  empirical  distribution  of  the  dd s:  P[d  >  v)  =  j^{k\dk  >  v} / K 

9.  Compute  the  p-value:  e0  =  P[d  >  d0). 

10.  Reject  the  null  hypothesis  that  the  two  samples  come  from  the  same  population  if 
£q  <  e. 


3k . ..........ra 


Reject  if  e  <  Bo 
I  Accept/Reject 


Figure  3:  Here  we  show  how  the  nonparametric  test  works  when  the  two  samples  X  and 
Y  are  from  arbitrary  distributions.  For  our  problem,  xt  and  yi  are  binary  characters.  In 
this  case  the  null  distribution  cannot  be  determined  theoretically. 
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Figure  4:  This  figure  shows  the  permutation  procedure  for  computing  the  null  distribu¬ 
tions. 

The  above  procedure  computes  the  null  distribution  of  the  distance  function  p(X.  Y) 
nonparametrically.  In  a  standard  parametric  hypothesis-testing  procedure,  the  forms  of 
the  distributions  of  x  and  y  are  known  (usually  assumed  to  be  Gaussian)  and  so  the 
null  distribution  of  the  test  statistic  p(X,  Y )  is  known.  In  contrast,  the  permutation  test 
does  not  make  any  prior  assumption  regarding  the  distributions  of  x  and  y.  Instead,  an 
empirical  null  distribution  is  created  by  randomly  permuting  the  data  set  Z  and  creating 
a  histogram  of  computed  test  statistics  (d,’s). 

By  design,  the  size  of  the  test,  e,  is  fixed.  Thus,  irrespective  of  the  distance  func¬ 
tion  p(X,  Y'|)  the  percentage  of  time  that  the  validation  procedure  rejects  a  true  null 
hypothesis  that  the  two  samples  are  from  the  same  underlying  population  is  e.  In  other 
words,  the  probability  of  misdetection  is  e.  What  is  not  fixed  is  the  probability  of  false 
alarm,  7.  which  is  the  probability  that  the  procedure  claims  that  X  and  Y  come  from  the 
same  underlying  population  when,  in  fact,  they  come  from  different  underlying  popula¬ 
tions.  Although  the  use  of  various  distance  functions  for  p  and  8  gives  rise  to  the  same 
probability  of  misdetection  t.  each  has  a  different  probability  of  false  alarm  7, 

It  is  important  to  note  that  if  two  samples  X  and  Y  pass  the  validation  procedure, 
this  does  not  mean  that  we  accept  the  null  hypothesis.  Rather,  it  means  that  we  do 
not  have  enough  statistical  evidence  to  reject  the  null  hypothesis.  Nevertheless,  when 
we  reject  a  null  hypothesis,  this  does  mean  that  we  have  enough  statistical  evidence  to 
reject  it. 

4.3  Power  Functions 

Let  us  assume  that  the  ay’s  are  distributed  as  F{6x ■)  and  the  yd s  are  distributed  as  F{0y ), 
where  Ox  and  Oy  are  the  parameters  of  the  distributions.  Let  the  null  hypothesis  / /  \ 
and  the  alternate  hypothesis  Ha  be 


Ox  =  Oy 
Ox  7^  0Y 
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Hn  : 
Ha  : 


(1) 

(2) 


The  size  of  the  test,  e,  is  fixed  by  the  algorithm  designer  and  is  given  as 

e  =  P(Ha\Hn  is  true)  .  (3) 

The  plot  of  1  minus  the  probability  of  false  alarm  as  a  function  of  9  is  the  power  function. 
Thus,  if  we  fix  the  distribution  parameter  of  the  xfs  at  Ox  =  9q .  and  vary  the  distribution 
parameter  value  9y  =  9  for  the  yf s,  the  power  function  is  denoted  by  7 e0{9),  and  is  given 

by 

ie0{9)  =  p(ha \ex  =  e0  and  oY  =  o).  (4) 

Thus  1  —  77, (0)  is  the  probability  of  false  alarm.  The  power  function  should  have  a 
minimum  at  0\  =  9y  =  9 o,  with  7@0(#0)  =  7  and  should  increase  on  either  side  and  go 
up  to  1  when  9y  =  0  is  very  far  from  90. 

Let  us  say  there  are  two  validation  schemes  .4  and  B  with  test  size  e  and  power 
functions  7 f0(9)  and  7 fg(9).  Since  the  misdetection  probability  e  is  the  same  for  both 
schemes,  A  is  better  than  B  if  the  false  alarm  rate  of  A  is  less  than  the  false  alarm  rate  of 
B.  That  is,  A  is  better  than  B  if  1  —  7 ^  (0)  <  1  —  7^  (9)  or  7^  (9)  >  7^  (9)  .  If  this  relation 
is  true  for  all  values  of  9 ,  the  procedure  A  is  said  to  be  uniformly  more  powerful  than  B. 
That  is,  scheme  A  is  better  than  scheme  B  if  the  power  function  plot  of  A  is  above  the 
power  function  plot  of  B  for  all  values  of  9.  Generalizing,  if  there  are  many  validation 
schemes,  the  one  whose  power  function  is  above  all  other  power  functions  is  the  best 
scheme.  If  the  power  functions  intersect,  there  is  no  clear  winner;  for  some  regions  in  the 
parameter  space  one  scheme  is  better  while  in  other  regions  the  other  scheme  is  better. 


l.ftj 


Figure  5:  The  true  parameter  of  the  sample  X  is  Qx-  The  parameter  0y  of  the  sample 
Y  is  updated  and  the  corresponding  probability  of  the  test  rejecting  the  null  hypothesis 
that  X  and  Y  are  from  the  same  underlying  distribution  is  plotted.  The  resulting  curve 
is  the  power  function. 

For  a  given  validation  scheme,  if  we  increase  the  sample  sizes  N  and  M .  the  power 
function  changes  and  the  new  power  function  is  higher  than  the  old  power  function,  and 
so  by  definition  is  more  powerful.  Thus  the  sensitivity,  i.e.,  the  width  of  the  notch  at  the 
minimum,  is  a  function  of  the  sample  sizes  N  and  M.  When  the  sample  size  is  small, 
the  notch  is  broader,  and  when  the  sample  size  is  large,  the  notch  is  sharper.  This  fact 
is  used  in  deciding  what  sample  size  should  be  used  for  the  test:  choose  the  sample  size 
such  that  the  desired  probability  of  false  alarm  is  attained  when  the  parameters  Ox  and 
9y  differ  by  a  specified  amount  A 9. 
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Finally,  our  validation  scheme  described  in  the  previous  section  is  dependent  on  two 
distance  functions  p  and  8.  Thus,  each  choice  of  p  and  8  gives  rise  to  a  different  power 
function.  The  combination  that  produces  the  highest  power  function  is  the  best  choice. 
See  [1]  for  details  on  power  functions. 


4.4  Distance  Functions,  Outliers,  and  Robust  Statistics 


Various  distance  functions  p(X,  Y )  can  be  used  for  computing  the  distance  between  the 
sets  of  characters  X  and  Y.  We  use  the  following  symmetric  distance  functions  for  p. 

Mean  Nearest  Neighbor  Distance: 


where 


p(X.,  Y)  -  pM,an(X,  Y) 


Pm  ean  (Y]X)  +  PMean(X]Y) 
N  +  M 


Pm  ean  (Y;X) 


Pm  ean  (*;  y) 


Trimmed  Mean  Nearest  Neighbor  Distance: 


where 


p(X,Y)=pTrim(X,Y) 


PTrim(X ,  V)  T  PT rim(X ,  V j 
2 


PTrimi  1  ^  ) 


PTrimiyX ,  V) 


Trim^gx 

Trimly 


Here  the  Trim  function  accepts  as  input  a  set  of  real  numbers,  orders  them,  and  then 
discards  the  top  and  bottom  10%  and  returns  the  mean  of  the  remaining  80%. 

Median  Nearest  Neighbor  Distance: 


where 


p{XX)  =  pMed{X,Y) 


( PMed(Y i  X)  +  pMed{X  ]  Y  ))/2 


PMed{Y  :  %  ) 
PMed{X ;  V  ) 


Median 

Median 


Notice  that  the  mean  nearest  neighbor  distance  is  not  a  robust  distance  measure.  That 
is,  if  for  some  reason  a  data  point  is  far  from  the  norm,  the  p- value  computation  becomes 
very  sensitive  to  this  data  point.  This  can  occur,  for  example,  when  a  character  in  the 
real  data  set  X  is  actually  a  ‘c’  (instead  of  being  an  ‘ e , 1 ) ,  and  is  identified  incorrectly  as 
an  ‘e’.  Yet  another  outlier  source  is  connected  characters:  when  characters  are  extracted 
from  a  real  document,  they  may  touch  other  characters,  pieces  of  which  might  slip  in. 
The  median  and  the  trimmed  mean  distance  measures  are  robust  against  outliers  since 
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p(X;Y)  =  (u1  +  u2+...  +  u4>/4 
p(Y;X)  =  (V!  +  v2  +  ...  +v5)/5 
P(X,Y)  =  (p(X;Y)  +p(Y;X))/2 


Figure  6:  The  black  dots  are  elements  of  the  set  X  and  the  white  dots  are  elements  of 
the  set  Y.  In  the  figure  on  the  left,  the  distance  p(X;  Y)  from  Y  to  X  is  computed  by 
summing  the  distance  of  each  y,  to  the  nearest  xt.  Similarly  on  the  right  the  distance 
p()  '■  A  )  is  computed.  The  final  symmetric  distance  p(X.  Y)  is  computed  by  taking  the 
mean. 

they  do  not  look  at  the  tails  of  the  distribution.  One  would  expect  that  these  measures 
should  work  better  in  cases  where  there  are  outliers. 

The  distance  function  y)  mentioned  earlier  is  the  distance  between  two  individual 
characters  x  and  y.  We  use  the  Hamming  distance  for  6.  This  is  computed  by  counting 
the  number  of  pixels  where  the  characters  x  and  y  differ  after  the  centroids  of  x  and  y 
have  been  registered.  A  variety  of  other  character  distances  6(x,  y  )  and  set  distance  func¬ 
tions  p(X,  Y)  could  have  been  used  (e.g.  the  Hausdorff  distance,  rank-ordered  Hausdorff 
distance,  etc.).  The  combination  of  character  distance  6 ( x .  y )  and  set  distance  p(  X .  V  ) 
that  gives  rise  to  the  best  power  function  is  the  best  pair  of  character  and  set  distances 
to  use  for  the  validation  procedure. 

5  Null  Distribution  for  Gaussian  Populations 

In  this  section  we  compute  the  null  distributions  of  two  set  distances  p( X,  Y )  when  xt  and 
yi  are  Gaussian  distributed.  We  show  that  when  they  are  each  Gaussian  distributed  with  a 
known  variance  <r2,  the  two  distance  functions  considered  are  \  2  distributed  under  the  null 
hypothesis.  Such  closed-form  solutions  for  the  null  distributions  are  possible  only  when 
the  underlying  distributions  are  known  a  priori.  However,  this  is  not  the  case  in  general 
—  the  Gaussian  assumptions  might  be  appropriate  in  some  settings  but  completely  wrong 
in  other  settings.  Thus  the  noil-parametric  permutation  method  described  in  Section  4  is 
a  much  better  approach  to  computing  the  null  distributions  when  the  forms  of  the  sample 
distributions  are  not  known.  Nevertheless,  for  the  purpose  of  validating  the  software  and 
algorithm  for  computing  the  empirical  null  distribution,  the  Gaussian  case  is  very  useful 
since  it  allows  us  to  compare  the  empirical  distributions  against  known  (theoretically 
computed)  distributions. 
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5.1  Inter-Cluster  Mean  Distance 

Let  X  =  { x ] .  x?. .  .  . ,  xn}  be  a  set  such  that  ay  £  R  and  ay  ~  N(px ,  cr2)-  Similarly,  let 
Y  =  {y i ,  2/2 s  ■  ■  ■  5  2/jv}  be  a  set  such  that  yt  £  R  and  yt  ~  N(py ,  cr2).  The  problem  is  to 
test  the  null  hypothesis  that  px  =  //y  when  cr2  is  known. 

Now  we  know  that 


1  N 

px  =  x<  ~  *2/N) 

^  i= 1 

(5) 

1  N 

PY  =  -T J^yi  ~  n(py,v2/n)  ■ 

iv  i= 1 

(6) 

Therefore 

Px  ~  Py  ~  N(px  ~  Py,  2u2 /N) 

(7) 

and 

\Jn[2{px  ~  Py)/v  ~  X(px  —  py,  1)  • 

(8) 

Now  let 

N 

t  =  p(X,Y )  =  2a2^x  Yx)  ■ 

Thus  under  the  null  hypothesis  that  px  =  P v  ■  we  have 


t  =  p(X,Y)~xl-  (9) 

Thus,  instead  of  empirically  computing  the  distributions  as  described  in  Section  4  we 
can  use  the  above  analytic  form  of  the  distribution  to  accept  or  reject  the  null  hypoth¬ 
esis.  Moreover,  we  see  that  the  empirical  method  has  reduced  to  a  standard  statistical 
technique  when  the  underlying  distribution  is  known  to  be  Gaussian. 

5.2  Likelihood  Distance 

In  the  previous  section  we  picked  a  particular  distance  function  p(X,  Y )  and  showed 
that  its  null  distribution  is  y;2.  I11  this  section  we  pick  a  distance  function  based  on  the 
likelihood  function  of  the  data.  It  turns  out  that  this  distance  function  is  the  same  as 
the  one  used  in  the  previous  section. 

Let  X  =  {aq.  .r2, . r  \  },  where  ay  £  R,  and  ay  ~  .Y(//  y .  a2).  Similarly,  let  Y  = 

{yi,  2/2,  •  •  •  ,  2/jv } ,  where  yt  £  R,  and  y  i  rsJ  JV(//y,<72).  The  problem  is  to  test  the  null 
hypothesis  that  yx  =  yy  =  P- 

Let  py(X )  denote  the  distance  of  set  X  from  set  Y.  Here  we  use  a  function  of  the 
likelihood  for  p  : 

Px(Y)  =  f(P(y1,...,yN\x1,...,xN,<j))  (10) 

Py{X)  =  /(P(q, . . .  ,sjv|yi,  ■  •  •  ,yN,<r))  •  (11) 

In  general,  the  above  distances  need  not  be  symmetric  in  X  and  Y.  Hence  we  also  consider 
symmetric  distances  of  the  form 

P(X,Y)  =  f{P(yn  ■  ■  ,,vn\x  r,  ■  • .  1xNla)P(xll .  . .  ,sjv|  yu  . .  .,yN,o))  ■  (12) 
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We  can  also  consider  the  right-hand  side  in  the  equation  above  divided  by 
log  max,,  P(x i,  •  •  ■ ,  xNl  y1, .  .  . ,  yN\lh  cr)-  That  is, 

/  x  Y x  =  J  f  P{yi,  •  •  •  >  Vn\x  1,  ■  ■  ■ ,  gjy,  <7)^(3?!, . . . ,  gjyjgi,  ■  ■  ■ ,  yjy,  <7) 

°  \  max,,,  P(a-i, .  .  .  ,xNly1,  .  .  .  ,yN\fi,a) 

We  can  use  the  standard  rules  of  probability  theory  to  manipulate  the  above  equation 
as  follows: 


P{y'\i  ■  ■  -,yrt  |®i,  -..,xN,a) 

/oo 

/;(/b  V\ . /yvk'i-  •  ■  ■  ,  ZJV,  cr)d/i 

-CO 

P(?/1, . . .  ,vy.V:-Ti,  ■  ■  .  ,a :N,fJt,a) 


f. 

I 

/ 


CO 

CO 


/^.T] , - .T  v,  a) 

P(y/1,  -  ■  ■  ,  Vn\x  1,  ■  ■  ■  ,  3-jy,  yq  Q-)P(^1,  ■  ■  ■ ,  XN,  fi,  a) 

P(xl:  ■  ■  -,xN,a) 

P{yi,  ■  •  •  •  .,xN\fj.,a)P(fj.,a) 

f-00  P(^1,  ■  ■  ■ ,  XN |  a,  cr)P(A,  cr)dA  /V  ' 


dfi 


(14) 


Now  we  make  the  assumption  that  //  and  cr  are  independent  so  that  P(/q  cr)  =  / J ( // ) Pin). 
Furthermore  we  assume  that  //  and  <r  have  a  uniform  prior.  Although  this  implies  the 
prior  is  improper  (since  its  integral  is  not  equal  to  1),  the  posterior  distribution  integrates 
to  1.  Thus  P(/q  cr)  =  P(//)P(cr)  oc  e.  But  the  e  in  the  numerator  and  the  denominator  of 
equation  (14)  cancel  out  and  the  numerator  can  now  be  written  as  follows: 


P(y  1,  •  •  •  ,  2/jv|/F  a)P(x$.p  ■  ■  ■  ,  xN\fl,  (r)P(fi,  cr) 

N  __  /  -  \  N 


(_!_)  ( _J_Y 

\  \Z2tt(j  J  \  v'Wrcr  / 

\a/27T  cr  / 


2^2  1  m)2 


e  2 


jA-  [ELh'-rf+E^ihj-^)2 


(15) 


Since  the  denominator  is  not  a  function  of  either  //  or  jq, .  .  .  ,  vy/v,  it  is  a  constant. 
The  denominator  can  be  computed  by  integrating  out  fi,y1, . .  .  ,y jy  from  the  probability 
density  in  equation  (15).  Thus 


P{yi,  ■  ■  •  :  UN  kl  •.  ■  ■  -WY-Cr) 


(16) 


where  the  constant  of  integration  C  can  be  found  by  equating  the  right-hand  side  to  1. 
In  order  to  compute  the  integral,  we  simplify  the  exponent  inside  the  integral: 


N 


N 


£(»•  -  yf  +  XXp?  -  h)2 
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=  E(sf« _  y  +  y  ~  v)2  +  -  *  +  *  -  /y)2 

i=l  j=l 

JV  JV 

=  I ](?/*  -  y)2  +  £(»<  -»)(»“  /*)  +  -  hf 

8  =  1  8=1 

IV  JV 

XX®*  “  xf  +  ^2ixi  -  :,'){x  -  /')  +  '\  (:/•  -  // )2 

i=i  .7=1 

IV  JV 

=  -  y)2  +  £(®<  -  ^)2  +  Niv  -  h)2  +  -  vf 

8  =  1  j=  1 


(y  -  h)2  +  (®  -  /i)2  =  X2  +  y2  +  2  fi 2  -  2/i 

-2,-2  «,  /®  +  »V 

=  x  +y  -  2/j  y—^—  J 

_  (. X 2  +  y2-  2xy)  +  2  ' ^  _  ^.r  +  yj 

=  <Ll»>L+s  [„-(£+£)]“ . 


(^2  +  y2^  ~  2xy)  +  2  ^  ^ 

ii^iZ+aL-  (iti)]2 


Thus  from  equations  (18)  and  (17) 

JV  JV 

XX'h  -  ^)2  +  J2(x.i  ~  y)2 

8  =  1  j  =  l 

=  i>,  -  q2 + a»  -  y)2 + y u  -  si2 + 27v  („  -  (Ap))2 


Also,  since 


y/2^(a/V2N) 


1  /  ,,_£±1'|2 
g  2<t2/2JV  V  2  )  fill  = 


dji  =  1, 


we  have 


P(?/i, . . .  ,/yv|.'f|.  •  •  -,xN,a) 

r  (  1  V*vg™  p4i[E-T.-^)2C1fc-»)2+f(-»)2] 

Now  to  get  the  value  of  C  we  proceed  as  follows: 

1  =  cf  ...r  (^N 

J-  oo  J-oo\s/2^a)  h 
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=  G  J 

=  cf-L-  Vv^e-^EL(— )1. 

\y/2waj  y/N 

Thus  we  have  computed  C  to  be 


c  = 


-(iV+1) 


“  W^aJ 

We  now  can  write  the  complete  conditional  density  as 

P{yi,  ■  ■  ...,xN,a) 

=  (  )  y/Ne^ TwE-i^'-^2] 

\  \/2/Ka  J 


y/} /Vgi^prE^1^  a’)2] 


1  V  yg^  .  e2^/2y  [Ef=i(a;»-^)2+EJA=i(^-i')2+f  (*-g)2] 

\f2xo  J  y/2  N 


/  i  \N 

( _J_ ) 

y  y/*2/K  <7  ) 


Thus  we  can  use  2a2  times  the  negative  exponent  of  the  conditional  probability,  as 
given  in  equation  (24),  as  the  test  statistic  px(Y%  Notice  that  it  is  not  symmetric  in  X 
and  Y. 


Px{y)  =  f(P{yi,  ■  ■  ■  ,yN\xi,  •  •  ■  ,xN,a))  (25) 

N  1 

=  -log-P(»v*»  ■  ■  >2/jvNi,  ■  ■  +  —  log(27T<72)  -  -log(2)  (26) 

=  XX'h  y  f  ■  (27) 

i= i  z 

Py(X)  =  f(P(xi1...,xN\yll...,yNla))  (28) 

N  1 

=  -logP(xi, . . .  ,  ®jvbi,  •  •  • ,  2/JV,  cr)  +  y  log(27T<72)  -  -log(2)  (29) 

N  ly 

=  ^(ay- -z)2  +  — (y -x)2.  (30) 

i  z 

In  order  to  get  a  symmetric  test  statistic,  we  can  look  at  the  product  of  the  conditional 
probabilities,  so  that 

N  ly  N  jy 

PX(Y)  +  Py(X)  =  -  yf  +  -(X  -  yf  +  -  if  +  y(S  -  if-  (31) 

i= 1  i— 1 

But  we  know  that  the  sum  of  the  within- cluster  scatter  and  the  between-cluster  scatter 
is  equal  to  the  total  scatter.  Thus 


E(^  -  yf  +  -  ^)2  +  El®*  -  ^)2  =  E 

«  =  1  Z  t  =  l  8  =  1 


a;  +  y 


2  W  , 

+  E  (-p  - 

i=i  v 


z+J/^2 
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Notice  that  for  given  data  sets,  the  above  summation  is  the  same  constant  regardless  of 
which  points  go  with  xt  and  which  with  y,: .  Thus 

N 

PX(Y)  +  PY(X)  =  C  +  -(y  -  xf  (32) 

where  C  is  a  constant.  Thus  a  symmetric  test  statistic  based  on  likelihood  is 

P(X-Y)  -  .  (33) 


The  reason  for  normalizing  by  a2  will  become  clear  shortly. 

Monte  Carlo  hypothesis  tests  can  now  be  conducted  with  the  distance  functions  p 
defined  in  this  section.  In  Figure  7  we  show  that  the  theoretically  computed  null  distri¬ 
bution  agrees  with  the  null  distribution  computed  empirically  by  random  permutations. 

It  is  important  to  statistically  compare  the  test  statistics  px(Y),  py(X),  and  p{  X .  V) 
computed  in  this  section.  Notice  that 

x  ~  JV(0,  a2  /N), 
y  ~  JV(0,<72/JV), 
x  —  y  ~  7V(0,  2<j2/N). 


Thus 


and 


(*  -yf 

piX,Y)= 


2a 2 

~N 


Xi- 


Xi 


Thus  p(X.  Y)  has  a  mean  of  1  and  variance  of  2.  Similarly, 


(34) 


4i>.  -  yf  ~  ak-i 

a  i=  1 


Thus 


so  that 


^  I ZiVi  ~  y  f  +  x  -  y)2  ~  Xn-i  +  Xi , 


Px{Y)  -  Xn  ■ 


(35) 


We  see  that  px{Y)  has  a  mean  of  N  and  variance  of  2 A  .  This  implies  that  p(X,Y)  is  a 
more  powerful  test  statistic  (in  terms  of  false  alarms)  than  px(Y)  or  py(  X ). 


6  Experimental  Protocol  and  Results 

In  this  section  we  outline  the  protocol  we  use  to  conduct  the  experiments.  Here  we  give 
all  the  sample  sizes  we  use,  the  numbers  of  trials  that  are  run  at  different  stages,  the 
exact  model  parameter  values  that  are  used  for  generating  the  synthetically  degraded 
characters,  the  impact  of  the  distance  functions,  etc.  Three  types  of  experiments  are 
possible: 
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Synthetic  vs.  Synthetic:  One  sample  X  is  synthetically  created  using  the  document 
degradation  model,  with  a  fixed  model  parameter  value.  Then  many  samples  Y 
are  generated,  again  using  the  model,  but  with  different  parameter  settings.  The 
validation  procedure  can  be  run  on  the  samples  X  and  Y.  and  the  power  function 
generated.  This  experiment  is  in  part  a  sanity  check  for  the  methodology:  if  it  does 
not  work  on  controlled  synthetic  data,  there  is  little  point  in  trying  it  on  real  data. 

Real  vs.  Real:  This  experiment  tests  for  systematic  dissimilarities  between  two  im¬ 
age  populations  (e.g.  rotations,  fonts,  etc.).  Note  that  this  use  of  the  validation 
procedure  is  independent  of  the  degradation  model. 

Real  vs.  Synthetic:  Here  the  sample  X  consists  of  real  degraded  characters  and  the 
sample  Y  is  generated  by  varying  the  degradation  model  parameter  0.  The  valida¬ 
tion  procedure  is  run  on  the  X  and  Y  samples,  and  a  power  function  is  generated. 
This  experiment  tests  whether  or  not  the  synthetic  characters  are  actually  close  to 
the  real  characters. 

6.1  Protocol  for  Synthetic  vs.  Synthetic 

The  following  protocol  is  used  for  creating  the  samples  X  and  Y.  The  distribution  pa¬ 
rameter  ©x  is  fixed  with  the  following  parameter  component  values:  rjj  =  ry,  =  0. 
a0  —  =  1;,  a  —  (3  —  1.5,  and  structuring  element  size  k  =  5.  The  distribution  parame¬ 

ter  0y  is  varied  by  varying  a  and  13.  In  our  experiments  we  make  a.  equal  to  fi.  The  other 
parameter  components  of  @y  —  rj j .  rji, ,  o0 ,  /30,  k  —  are  made  equal  to  the  corresponding 
components  of  the  model  parameter  ©y.  In  all  cases  the  noise-free  document  is  the  same 
(a  ETgX  document  page  formatted  in  IEEE  Transactions  style)  and  the  same  set  of  340 
characters  ‘e’  (Computer  Modern  Roman  10  point  font)  are  extracted  from  the  page  to 
create  the  samples  X  and  Y. 

The  validation  procedure  parameters  used  are  as  follows: 

1.  Sizes  of  samples  X  and  Y:  N  =  M  =  {10,20,60}. 

2.  Number  of  permutations:  K  =  1000. 

3.  Significance  level  of  the  test:  e  =  0.05. 

4.  Number  of  repetitions  used  in  computing  the  power  function:  T  =  100. 

5.  The  character-to-character  distance  6  ( x .  y  )  used  is  the  Hamming  distance. 

6.  The  set-to-set  distance  p(X.  Y)  used  is  the  mean  nearest-neighbor  distance. 

The  noise- free  document  is  shown  in  Figure  8(a).  The  degraded  document  generated 
with  model  parameter  0j  is  shown  in  Figure  8(b).  The  power  functions  for  sample  sizes 
10,  20,  60  are  shown  in  Figure  9.  The  power  function  corresponding  to  sample  size  10  is 
the  widest,  and  the  power  function  corresponding  to  sample  size  60  is  the  narrowest.  Note 
that  all  three  power  functions  give  a  misdetection  (reject)  rate  close  to  e  =  0.05  when  @y 
is  close  to  O  x-  (Only  the  a  component,  which  is  equal  to  1.5  for  0  y ,  is  shown  in  the  plot.) 
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Furthermore,  when  the  a  component  for  0y  is  far  from  1.5,  the  misdetection  rate  is  close 
to  1.0,  which  implies  that  the  validation  procedure  can  distinguish  the  two  samples  with 
high  probability.  An  image  generated  with  a  =  8  =  1.7  that  the  validation  procedure 
accepted  with  a  probability  close  to  0.9  is  shown  in  Figure  8(c).  Two  document  images 
generated  with  parameter  values  a  =  8  =  2.0  and  a  =  8  =  0.9  that  are  easily  rejected 
by  the  validation  procedure  are  shown  in  Figure  8(d)  and  Figure  8(e),  respectively. 

6.2  Protocol  for  Real  vs.  Real  Experiment 

First,  various  European  language  texts  are  generated  using  the  Adobe  Times- Roman 
typeface  at  8  point.  Next,  these  documents  are  printed  on  a  Canon  laser  printer  and 
then  scanned  at  400  pixels  per  inch  using  a  Canon  scanner.  Lower-case  ‘e’s  are  extracted 
semiautomatically  by  OCR  (thus  some  characters  possess  artifacts  resulting  from  reseg¬ 
mentation).  From  among  these,  3000  characters  are  selected  by  two  persons  working 
independently  to  avoid  misclassihcations. 

Before  selecting  the  two  populations,  we  randomly  shuffle  the  real  data  in  order  to 
obscure  any  systematic  per-page  dissimilarities  (clue  to,  for  example,  skew  scale  varia¬ 
tions).  The  validation  procedure  does  not  reject  the  null  hypothesis  that  the  two  samples 
are  from  the  same  underlying  population.  Repeated  trials  give  a  reject  rate  close  to  0.05, 
the  significance  level  designed  into  the  test. 

6.3  Outliers  and  Distance  Function  Comparisons 

The  validation  procedure  protocol  is  as  follows:  the  significance  level  e  is  fixed  at  0.05; 
the  sample  sizes  N  =  M  used  are  10,  20,  and  60;  the  number  of  permutations  K  for 
creating  the  empirical  null  distribution  is  1000;  the  number  of  trials  T  for  estimating  the 
misdetection  rate  is  100. 

We  studied  the  sensitivity  of  the  validation  procedure  to  the  set  distance  p(X,  Y)  as 
follows.  The  data  sets  X  and  Y  are  collections  of  (synthetic)  degraded  characters  ‘e’. 
The  degradation  parameter  values  for  X  are  fixed  at  a  =  13  =  1.5,  but  the  corresponding 
degradation  parameters  for  Y  are  varied  from  0.6  to  2.4.  The  Hamming  distance  is  used 
for  the  character-to-character  distance  8  ( x .  y ) .  The  sample  size  of  X  and  Y  is  fixed  at 
N  =  M  =  60.  The  mean,  trimmed  mean  and  median  distances  are  used  to  compute  the 
power  function,  in  both  the  presence  and  absence  of  outliers. 

Figures  10(a),  11(a),  and  12(a)  show  the  power  functions  in  the  absence  of  outliers 
when  the  mean  and  trimmed  mean  distances  are  used.  Next,  we  introduced  outliers 
into  the  data  set  X  by  replacing  five  degraded  ‘e’s  with  degraded  ‘c’s.  The  Y  data 
set  is  unchanged.  Figures  10(b),  11(b),  and  12(b),  show  the  power  functions  in  the 
presence  of  outliers.  Clearly  the  median  and  trimmed  mean  nearest  neighbor  distances 
are  more  robust  against  outliers,  since  the  corresponding  power  functions  are  not  affected. 
Furthermore,  it  can  be  seen  that  the  median  NN  distance  function,  in  the  outlier-free 
case,  is  less  “powerful”  than  the  mean  NN  distance  function  since  the  median  distance 
power  function  plot  lies  below  the  mean  distance  power  function  plot.  Finally,  it  can  be 
seen  that  the  10%  trimmed  NN  distance  function  is  superior  to  the  other  two  distance 
functions,  since  the  corresponding  power  function  is  robust  against  outliers  and  at  the 
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same  time  higher. 


6.4  Protocol  for  Validating  Real  vs.  Synthetic  Degradations 

The  ideal  data  is  a  ETj^X  formatted  document.  The  IEEE  Transactions  style  is  used 
for  typesetting  the  document.  The  corresponding  ideal  binary  image  and  character 
groundtruth  are  created  using  the  DVI2TIFF  software.  The  ideal  document  is  created  at 
300  X  300  dots/inch  resolution;  the  size  of  the  binary  document  in  pixels  is  3300  X  2550. 
This  document  is  printed  using  a  SparcPrinter  II.  Next,  the  original  printed  document 
is  photocopied  five  times  using  a  Xerox  photocopier  —  once  at  the  normal  setting,  twice 
with  darker  settings,  and  twice  with  lighter  settings.  Finally  the  five  photocopied  doc¬ 
uments  are  scanned  using  a  Ricoh  scanner.  The  scanner  is  set  at  300  X  300  dots/inch 
resolution.  The  rest  of  the  scanner  parameters  are  set  at  normal  settings.  The  scanned 
binary  image  is  of  size  3307  X  2544.  The  parameters  are  then  estimated  using  the  proto¬ 
col  specified  in  [12],  In  all  cases  the  noise-free  document  is  the  same  (a  ETgX  document 
page  formatted  in  IEEE  Transactions  style)  and  the  same  set  of  340  characters  ‘e’  (Com¬ 
puter  Modern  Roman  10  point  font)  is  extracted  from  the  page  to  create  the  synthetic 
population  Y. 

The  validation  procedure  parameters  used  are  as  follows: 

1.  Sample  sizes  of  scanned  characters  X  and  synthetic  characters  Y:  N  =  M  = 
{10,20,60}. 

2.  Number  of  permutations  for  creating  the  empirical  null  distribution:  K  =  1000. 

3.  Significance  level  of  the  test:  e  =  0.05. 

4.  Number  of  bootstrap  repetitions  for  computing  the  reject  rate  of  the  test:  T  =  100. 

5.  The  bootstrap  samples  are  sampled  (with  replacement)  from  a  pool  of  size  /V/,  = 

100. 

6.  The  character-to-character  distance  8(x,y)  used  is  the  Hamming  distance. 

7.  The  set-to-set  distance  p(X.  Y)  used  is  the  mean  nearest-neighbor  distance. 

The  above  test  was  conducted  on  ‘e’s.  The  test  did  not  reject  the  null  hypothesis  that 
the  samples  are  from  the  same  population  for  a  sample  size  of  10.  That  is,  the  reject  rate 
is  lower  than  5%.  For  the  sample  size  of  20,  46%  of  the  time  the  test  rejected  the  null 
hypothesis.  For  sample  size  of  60,  the  null  hypothesis  was  rejected  100%  of  the  time. 

6.5  Comparing  Two  Models 

In  the  previous  section  we  used  a  two-sample  permutation  procedure  to  test  the  null 
hypothesis  that  the  sample  of  real  degraded  characters  and  the  sample  generated  by  the 
estimated  degradation  model  are  from  the  same  underlying  population.  We  found  that 
when  the  sample  size  is  40,  the  test  procedure  rejects  the  null  hypothesis. 
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In  fact,  in  a  two-sample  test,  if  one  of  the  samples  is  from  a  distribution  that  is  even 
slightly  different  from  the  second  sample’s  distribution,  the  statistical  testing  procedure 
will  be  able  to  reject  the  null  hypothesis  that  the  samples  are  from  the  same  underlying 
population  if  the  sample  size  is  large  enough. 

Since  we  know  that  any  model  of  a  real  process,  with  very  high  likelihood,  is  an 
approximation  to  the  real  process,  the  samples  generated  from  the  model  will  be  different 
from  the  real  samples.  Thus  any  validation  procedure  will  be  able  to  distinguish  the  real 
and  synthetic  samples  if  the  sample  sizes  are  large  enough.  In  other  words,  it  is  futile 
to  test  the  equality  of  the  distributions  of  the  synthetic  samples  and  the  real  samples; 
they  will  always  be  proved  to  be  unequal  if  a  large  enough  sample  size  is  used.  Even  if 
some  other  validation  procedure  is  used,  for  example  any  method  based  on  comparison 
of  confusion  matrices,  the  equality  test  is  always  going  to  give  a  negative  result  when  the 
sample  size  is  made  large  enough. 

The  next  question  is:  How  can  one  use  the  validation  procedure  in  practice  if  the 
models  are  always  going  to  be  proved  incorrect?  The  way  to  use  the  validation  procedure 
is  to  compare  two  models  and  not  evaluate  just  one  model.  That  is,  one  can  use  the 
validation  procedure  to  determine  which  model  is  closer  to  reality. 

Suppose  we  have  two  document  degradation  models  Mi  and  M2.  The  problem  is  to 
find  the  model  that  is  closer  to  the  real  process.  We  know  that  if  the  sample  size  N  of 
the  synthetic  and  real  samples  is  increased,  after  a  certain  point  the  validation  procedure 
will  start  rejecting  both  models.  However,  we  will  now  give  a  procedure  that  will  allow 
a  researcher  to  decide  which  model  is  closer  to  reality  for  a  fixed  sample  size  N. 

1.  Fix  the  sample  size  N. 

2.  We  are  given  a  real  sample  D  of  size  N. 

3.  Generate  synthetic  samples  S\  and  62  of  size  N  using  the  models  Mi  and  M2 
respectively. 

4.  Conduct  the  two-sample  validation  test  using  the  real  sample  D  and  the  synthetic 
sample  Sj.  Let  the  associated  p-value  be  pi. 

5.  Conduct  the  two-sample  validation  test  using  the  real  sample  D  and  the  synthetic 
sample  S2.  Let  the  associated  p-value  be  p2. 

6.  If  pi  >  p2,  model  Mi  is  closer  to  the  real  process  for  a  sample  size  of  N.  Otherwise 
model  M2  is  closer. 

Thus  the  above  procedure  allows  a  researcher  to  choose  between  models.  When  we 
were  choosing  between  parameter  settings  for  a  fixed  model,  we  could  use  the  power 
function  to  arrive  at  the  best  parameter  sitting.  However,  two  different  models  have 
different  parameter  spaces  and  hence  they  cannot  compared  using  power  functions.  The 
p-value  provides  a  means  of  comparing  the  models  on  a  common  basis. 
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7  Conclusions 


We  have  posed  the  degradation  model  validation  problem  as  a  statistical,  two-sample, 
hypothesis  testing  problem.  A  non-parametric  permutation  test  is  used  for  this  purpose. 
The  user  specifies  a  test  statistic,  which  is  essentially  a  distance  function  on  the  two 
sets  of  degraded  characters.  The  null  distribution  of  the  test  statistic,  which  is  the 
distribution  of  the  test  statistic  under  the  hypothesis  that  the  two  samples  come  from 
the  same  underlying  population,  is  created  using  a  permutation  procedure.  The  p-value 
corresponding  to  the  test  statistic  associated  with  the  two  sets  is  computed  and  compared 
with  a  user-specified  significance  level  to  reject  or  not  reject  the  null  hypothesis.  This 
procedure  and  several  robust  variants  are  implemented  and  evaluated  empirically.  The 
goodness  of  the  distance  functions  is  evaluated  using  power  functions,  which  are  standard 
statistical  devices.  The  local  degradation  model  passes  the  validation  test  when  the 
sample  size  is  small  but  rejects  it  when  the  sample  size  is  increased.  This  is  so  because 
any  model  of  a  real-world  process  is  an  approximation  and  thus  will  not  pass  the  test 
if  the  sample  size  is  increased.  Another  way  of  using  the  validation  procedure  is  for 
choosing  between  models.  After  the  validation  procedure  is  run,  a  p-value  is  obtained. 
Thus  if  two  different  models  are  tested  on  the  same  real  data,  each  validation  procedure 
gives  rise  to  a  p-value  for  each  model.  The  model  whose  associated  p-value  is  larger  is  in 
closer  agreement  with  the  real  data  and  thus  should  be  preferred. 
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Empirical  and  Theoretical  Null  Distribution 

N  =  75;  Num  Permuts  =  1000;  x,y  ~  N(1 5, 1 );  t  ~  %,2 


t  =  p(X,Y)  =  N(xbar  -  ybar)2/(2cj2) 


Figure  7:  Empirical  and  theoretical  null  distributions  for  two  sample  tests.  Samples 
X  and  Y  of  size  N  =  75  are  drawn  from  /V (  I  5 .  I  ) .  The  empirical  null  distribution  is 
computed  as  described  in  Section  4.  We  use  1000  random  permutations  for  computing 
the  distribution.  The  distance  function  used  is  t  —  p(X,Y)  =  N(x  —  y)2/(2a2).  The 
theoretical  distribution  of  t  is  x'2-  The  empirical  and  theoretical  plots  have  been  plotted 
together  in  this  figure. 
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Figure  8:  Local  document  degradation  model,  (a)  Subimage  of  the  noise- free  document, 
(b)  Degraded  document  generated  with  a  —  /3  —  1.5.  (c)  A  degraded  image  accepted  as 
similar  to  (b),  n  —  3  —  1 .7;  (d)  A  degraded  image  rejected  as  similar  to  (b),  o  —  3  —  0.9; 
(e)  A  degraded  image  rejected  as  similar  to  (b),  a  —  0  —  2.0.  The  sample  size  used  is  60. 


Power  Function  (DDM) 


Figure  9:  Power  plots  for  the  local  document  degradation  model.  The  parameters  for  X 
were  fixed  with  a  —  §  —  1.5,  while  the  parameters  for  Y  were  varied.  Notice  that  the 
power  function  has  a  minimum  near  a  =  f3  =  1.5.  The  power  function  corresponding  to 
a  sample  size  of  60  (boxes)  is  sharper;  that  corresponding  to  a  sample  size  of  10  (crosses) 
is  broader. 
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Figure  10:  Power  functions  of  the  validation  procedure  when  mean  nearest  neighbor 
distance  is  used  for  the  set  distance  function  p(X,  Y).  Figure  (a)  is  when  there  are  no 
outliers.  Figure  (b)  corresponds  to  the  situation  when  there  are  five  outliers  in  one  of 
the  data  sets. 
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Figure  11:  Power  functions  of  the  validation  procedure  when  median  nearest  neighbor 
distance  is  used  for  the  set  distance  function  p(X,  Y).  Figure  (a)  is  when  there  are  no 
outliers.  Figure  (b)  corresponds  to  the  situation  when  there  are  five  outliers  in  the  X 
data.  set. 
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Figure  12:  Power  functions  of  the  validation  procedure  when  10%  trimmed  mean  nearest 
neighbor  distance  is  used  for  the  set  distance  function  p(X,Y  ).  Figure  (a)  is  when  there 
are  no  outliers.  Figure  (b)  corresponds  to  the  situation  when  there  are  five  outliers  in 
the  X  data  set. 
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